Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Combine the conversion of UTF-8 to bytes into a single base function #22703

Closed
wants to merge 18 commits into from

Conversation

khwilliamson
Copy link
Contributor

@khwilliamson khwilliamson commented Oct 26, 2024

There are two functions that do this currently, utf8_to_bytes and bytes_from_utf8 This combines the guts of their implementation.

It also creates 3 new public macros to call the base function with a somewhat different API that is more convenient to use, and can avoid mallocs when not needed

The first three commits in this series are needed as a base for the rest, but are already in individual pull requests.

  • This set of changes requires a perldelta entry, and it is not yet included.

Prior to this commit these were illegal.

This causes embed.fnc to generate macro 'Perl_foo' #defined to be  macro
'foo'.  This could be used to easily convert existing macros into having
long names should some become a name space pollution problem.

This also documents in embed.fnc, under the 'm' flag discussion, how to
use this instead of placing things in mathoms.c, or creating stub
functions.
Commit e4d3d0c removed all the calls to
this function.
Outdent and reflow some comments and code in preparation for them to be
moved out of the loop
This is for clarity.  All this very-unlikely-to-be-used code was in the
middle of what is really going on, creating a distraction.
The previous version did not make sure that it wasn't reading beyond the
end of the buffer in all cases, and the first pass through the input
string already ruled out it having most problems.  Thus we don't need
the full generality here of the macro UTF8_IS_DOWNGRADEABLE_START; and
this simplifies things
These were misleading.  On ASCII platforms, many calls to this function
won't use the per-word algorithm.  That's only done for long-enough
strings.
The new name, s0, is used in more other places for this meaning, and is
more descriptive.
This is an internal function, designed to be an extension of
utf8_to_bytes(), with a slightly different API.  This function just adds
it and calls it from just utf8_to_bytes.

Future commits will extend this API.
This adds a third return possibility to this new function.  Its sole
caller still treats it as a boolean for now.
This variable should not be being changed by the function
The argument is currently unused.  The macro is a public facing API that
calls this function with the correct argument
This makes the next commit smaller
This causes this function to be able to both overwrite the input, and to
instead create new memory.  It changes bytes_from_utf8() to use this new
capability instead of being a near duplication of the core code of this
function.

Prior to this commit, bytes_from_utf8() just allocated memory the size
of the original string, and started copying into it.  When it came to a
sequence that wasn't convertible, it stopped, and freed up the copy.
The new behavior has it checking first before the malloc that the string
is convertible.  That has the advantage that there is no malloc without
being sure it will be useful; but the disadvantage that there is an
extra pass through the input string, but that pass is per-word.

The next commit will introduce another advantage.
Prior to this commit, the size malloced was just the same as the length
of the input string, which is a worst case scenario.  This commit
changes so the new pass through the input (introduced in the previous
commit) also calculates the needed length.

The additional cost of doing this is minimal.  It has advantages on a
very long string with lots of sequences that are convertible.
This is a non-destructive conversion of the input into native bytes, and
with any new memory required set for destruction via SAVEFREEPV.  This
allows the caller to not have to be concerned at all if memory was
created or not.

A new macro is created that calls this internal function with the
correct parameter to force this behavior.
@khwilliamson
Copy link
Contributor Author

This pull request is going forward instead via #22638

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant